server: continuous performance monitoring and PR comment #6283

phymbert · 2024-03-24T16:34:00Z

Motivation

In the context of:

server: bench: continuous performance testing #6233

starts the existing k6 script benchmark on the azure node using:

JIT GitHub docker runner ggml-org/ci#2

Then add a PR comment, example:

Attach results, image and logs as job artefacts:

Set the commit status with a minimized JSON results for later reprocessing.

Tested in:

TODO:

Merge ci: add install-docker.sh ggml-org/ci#1
Merge JIT GitHub docker runner ggml-org/ci#2
@ggerganov create an imgur secret named IMGUR_CLIENT_ID, see instructions
@ggerganov PM a classic repo token with access rights: actions:read and workflows (!!! The required token is too powerful - it can be used to delete the repo at the very least), a less risky approach to be implemented later on
start the github runner manager on this repo
Make prometheus image request
Add comment in the PR with results
random 503 on imgur, fixed by adding mermaid in backup

mscheong01 · 2024-03-26T02:02:27Z

.github/workflows/bench.yml

+    paths: ['.github/workflows/bench.yml', '**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', '**/*.swift', '**/*.m', 'examples/server/bench/**.*']
+  pull_request:
+    types: [opened, synchronize, reopened]
+    paths: ['.github/workflows/bench.yml', '**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', '**/*.swift', '**/*.m', 'examples/server/bench/**.*']


How about excluding examples/ subdirectories except for examples/server? It could help reduce unneeded runs

Let's do it in another PR if you dont' mind/

mscheong01 · 2024-03-26T02:11:05Z

.github/workflows/bench.yml

+  schedule:
+    -  cron: '04 2 * * *'


Do we need this scheduled run? If so, how will we view the results?

At the moment, it will do the steps not related to PR: commit status and upload artefact. I will later process all commit checks statuses to show performance improvements day after day.

That sounds awesome 👍 It would also be cool if we pile up the daily performance results somewhere and visualize the performance improvement.
Also, if scheduled run's role becomes to differ from PR-based runs too much, consider making it a separate workflow.

Yes I want to do something like, probably stored on GH pages.

https://home.apache.org/~mikemccand/lucenebench/indexing.html

But it will require a little time and logic to reprocess previous commits, taking into account parameters have changed :/

Don't think we should put much effort in reprocessing previous commits. Better to focus just on the new versions from now on

phymbert · 2024-03-26T07:41:17Z

@ggerganov Hi Georgi, I think it is OK for a first version, please review the changes

ggerganov · 2024-03-26T08:35:31Z

Cannot create a token with organization_self_hosted_runners:write since llama.cpp is not part of an organization

phymbert · 2024-03-26T08:54:25Z

Cannot create a token with organization_self_hosted_runners:write since llama.cpp is not part of an organization

@ggerganov I see, just create a classic token as I did on the fork will work. Also please add: workflow, write:discussion, repo:status, repo_deployment and public_repo.

phymbert · 2024-03-26T09:01:31Z

Cannot create a token with organization_self_hosted_runners:write since llama.cpp is not part of an organization

@ggerganov I see, just create a classic token as I did on the fork will work. Also please add: workflow, write:discussion, repo:status, repo_deployment and public_repo.

Also, probably it's better if you try to start the github runner manager yourself on the Azure T4 node:

git clone https://github.com/ggml-org/ci.git
cd ci
git remote add phymbert https://github.com/phymbert/ci.git
git fetch phymbert
git checkout hp/github-runner
./start-github-runner-manager.sh ggerganov/llama.cpp $TOKEN Standard_NC4as_T4_v3

ggerganov · 2024-03-26T09:54:22Z

A classic token with workflows requires full access to the repo section, so it's not an option.

I tried to make a fine-grained token with the following config:

But I get an error:

ggml-github-runners-manager
ggml-ci: starting github runner manager on repo=ggerganov/llama.cpp label=Standard_NC4as_T4_v3...
ggml-ci: github runner manager started.
ggml-ci: github runner manager logs:
         CTRL+C to stop logs pulling
ggml-ci: fetching workflows of ggerganov/llama.cpp ...
ggml-ci:     ggml-runner-90970032-23091437579-pull_request-1711446732 triggered for workflow_name=Benchmark
invalid JIT response code: 403
    {"message":"Resource not accessible by personal access token","documentation_url":"https://docs.github.com/rest/actions/self-hosted-runners#create-configuration-for-a-just-in-time-runner-for-a-repository"}

phymbert · 2024-03-26T09:59:24Z

A classic token with workflows requires full access to the repo section, so it's not an option.
invalid JIT response code: 403
{"message":"Resource not accessible by personal access token","documentation_url":"https://docs.github.com/rest/actions/self-hosted-runners#create-configuration-for-a-just-in-time-runner-for-a-repository"}

Yes I did not managed with fined grained token. Up to you, I do not see other option.

ngxson

LGTM, only need minor changes

.github/workflows/bench.yml

ngxson · 2024-03-26T09:59:28Z

.github/workflows/bench.yml

+          wget --quiet https://github.com/prometheus/prometheus/releases/download/v2.51.0/prometheus-2.51.0.linux-amd64.tar.gz
+          tar xzf prometheus*.tar.gz --strip-components=1
+          ./prometheus --config.file=examples/server/bench/prometheus.yml &
+          while ! nc -z localhost 9090; do


Maybe we should add a timeout here, just in case something goes wrong

The workflow will be killed after a while. If you don't mind it can be added later on

Yeah it's not very important, but I still prefer not to rely on CI timeout because it can be long (usually minutes or hours), we should add a timeout of 10 seconds here for example.

Co-authored-by: Xuan Son Nguyen <[email protected]>

phymbert · 2024-03-26T12:43:15Z

A classic token with workflows requires full access to the repo section, so it's not an option.

@ggerganov Please note that the github default runner that we are currently using has a global token for the whole repo (GITHUB_TOKEN), so adding another self hosted runner with the same privilege does not hurt.
Also, what are the risks ? I see:

people who have access to the VM can see the token
~~- people changing a workflow executed on this runner labels can do any action on the repo. But it should be someone allowed to trigger workflow run, so with write access.~~

We can add some checks in the runner manager to see on which branch the workflow will run, which author, and minimum approval.

Am I missing something ? Tell me if you want me to implement some security changes. I agree it might still be some work in progress

@ngxson if you have a better idea, this is very welcomed

EDIT: the token is only used in the manager to get a jitconfig, the runner will use GITHUB_TOKEN secret as all others runners. SO feel free to remove my ssh public key.

ngxson · 2024-03-26T13:00:53Z

The github token used by start-github-runner-manager.sh is only to generate JIT config as you said, so it's only visible to the manager container, but not the ephemeral runners created by the manager. In short, developers who trigger a new workflow cannot read the token.

The /actions/runners/generate-jitconfig endpoint requires administration:write as described in this page, that should be equivalent to "Administration" in fine-grant token. You need to add at least one repo to the token in order to see this list (I couldn't test it so I'm not sure if it works):

But anyway this permission is still very powerful, because it allows change list of collaborators, description of the repo,... So it should still be kept secured.

And yes anyone having access to the VM can see the token, but I don't think it's a problem. Gitlab runner works the same way (token need to be hard-written inside the VM). Also for now only you and @ggerganov have access to the VM so there's no problem.

ggerganov · 2024-03-26T13:10:04Z

The required token is too powerful - it can be used to delete the repo at the very least. So I'm hesitant to give access to it

It seems that the only option that we have atm is to limit access to the node just to myself and start the self-hosted manager. Will try to do this today or tomorrow

.github/workflows/bench.yml

phymbert · 2024-03-26T13:41:43Z

The required token is too powerful - it can be used to delete the repo at the very least. So I'm hesitant to give access to it

It seems that the only option that we have atm is to limit access to the node just to myself and start the self-hosted manager. Will try to do this today or tomorrow

Thanks Georgi, I am definitely all in for the least priviledge approach, especially when there is CI automation, human mistakes always happen.

If you have time, could you simply delete the work VM and create a fresh one ?
in order to verify all the installation scripts are working in:

JIT GitHub docker runner ggml-org/ci#2

I am just imagining an other option:

fork and sync llama.cpp in ggml-org
start the runner in ggml-org with appropriate fined grained token
in the workflow, change the commit status target to ggerganov/llama.cpp

Will it work ? less convenient but more secured.

ngxson · 2024-03-26T15:11:50Z

start the runner in ggml-org with appropriate fined grained token

Yeah seems like a good idea. Just remind that you can use change the repository in step actions/checkout@v3, so that the runner pull code directly from llama.cpp (it's a public repo anyway, so anyone can do git clone), no need to sync it in ggml-org.

phymbert · 2024-03-26T15:15:02Z

start the runner in ggml-org with appropriate fined grained token

Yeah seems like a good idea. Just remind that you can use change the repository in step actions/checkout@v3, so that the runner pull code directly from llama.cpp (it's a public repo anyway, so anyone can do git clone), no need to sync it in ggml-org.

The problem is how the workflow will be triggered this way. I need to think a little bit more if it's possible to schedule the workflow on another repo without git sync

ngxson · 2024-03-26T15:28:51Z

My idea is: From llama.cpp, we can send a request to ggml-org to tell it to trigger the pipeline. Imagine that it's a bit like our "Publish Docker image" step that make a call to the registry outside of the runner.

This approach requires llama.cpp to keep a ggml-org's token as secret, but the token only has actions:write permission so it shouldn't be a big problem. An example can be found here

github-actions · 2024-03-27T17:33:10Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3: 522 iterations 🚀

Concurrent users: 8, duration: 10m
HTTP request : avg=8956.63ms p(90)=25933.28ms fails=0, finish reason: stop=522 truncated=0
Prompt processing (pp): avg=237.61tk/s p(90)=698.3tk/s total=203.5tk/s
Token generation (tg): avg=97.26tk/s p(90)=257.93tk/s total=129.19tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=hp/server/bench/workflow commit=4a6bfa92c5cfa12efa264c4c145dd91e6c8aba60

Time series

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 522 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1711566053 --> 1711566683
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 556.13, 556.13, 556.13, 556.13, 556.13, 465.15, 465.15, 465.15, 465.15, 465.15, 483.88, 483.88, 483.88, 483.88, 483.88, 540.38, 540.38, 540.38, 540.38, 540.38, 555.07, 555.07, 555.07, 555.07, 555.07, 553.7, 553.7, 553.7, 553.7, 553.7, 567.62, 567.62, 567.62, 567.62, 567.62, 576.91, 576.91, 576.91, 576.91, 576.91, 591.9, 591.9, 591.9, 591.9, 591.9, 592.78, 592.78, 592.78, 592.78, 592.78, 614.57, 614.57, 614.57, 614.57, 614.57, 643.54, 643.54, 643.54, 643.54, 643.54, 648.49, 648.49, 648.49, 648.49, 648.49, 675.49, 675.49, 675.49, 675.49, 675.49, 656.89, 656.89, 656.89, 656.89, 656.89, 658.39, 658.39, 658.39, 658.39, 658.39, 656.66, 656.66, 656.66, 656.66, 656.66, 672.61, 672.61, 672.61, 672.61, 672.61, 670.33, 670.33, 670.33, 670.33, 670.33, 670.6, 670.6, 670.6, 670.6, 670.6, 669.1, 669.1, 669.1, 669.1, 669.1, 669.62, 669.62, 669.62, 669.62, 669.62, 665.84, 665.84, 665.84, 665.84, 665.84, 676.65, 676.65, 676.65, 676.65, 676.65, 675.42, 675.42, 675.42, 675.42, 675.42, 676.08, 676.08, 676.08, 676.08, 676.08, 683.13, 683.13, 683.13, 683.13, 683.13, 681.71, 681.71, 681.71, 681.71, 681.71, 681.43, 681.43, 681.43, 681.43, 681.43, 682.11, 682.11, 682.11, 682.11, 682.11, 685.05, 685.05, 685.05, 685.05, 685.05, 684.43, 684.43, 684.43, 684.43, 684.43, 686.74, 686.74, 686.74, 686.74, 686.74, 690.8, 690.8, 690.8, 690.8, 690.8, 698.35, 698.35, 698.35, 698.35, 698.35, 698.63, 698.63, 698.63, 698.63, 698.63, 698.86, 698.86, 698.86, 698.86, 698.86, 697.45, 697.45, 697.45, 697.45, 697.45, 697.08, 697.08, 697.08, 697.08, 697.08, 697.46, 697.46, 697.46, 697.46, 697.46, 695.86, 695.86, 695.86, 695.86, 695.86, 699.79, 699.79, 699.79, 699.79, 699.79, 698.11, 698.11, 698.11, 698.11, 698.11, 696.16, 696.16, 696.16, 696.16, 696.16, 691.65, 691.65, 691.65, 691.65, 691.65, 689.65, 689.65, 689.65, 689.65, 689.65, 688.07, 688.07, 688.07, 688.07, 688.07, 686.51, 686.51, 686.51, 686.51, 686.51, 679.97, 679.97, 679.97, 679.97, 679.97, 682.74, 682.74, 682.74, 682.74, 682.74, 682.75, 682.75, 682.75, 682.75, 682.75, 680.31, 680.31, 680.31, 680.31, 680.31, 683.18, 683.18, 683.18, 683.18, 683.18, 683.45, 683.45, 683.45, 683.45, 683.45, 684.14, 684.14, 684.14, 684.14, 684.14, 683.68, 683.68, 683.68, 683.68, 683.68, 686.19, 686.19, 686.19, 686.19, 686.19, 687.9, 687.9, 687.9, 687.9, 687.9, 688.43, 688.43, 688.43, 688.43, 688.43, 688.97, 688.97, 688.97, 688.97, 688.97, 688.75]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 522 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1711566053 --> 1711566683
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 24.56, 24.56, 24.56, 24.56, 24.56, 21.0, 21.0, 21.0, 21.0, 21.0, 20.73, 20.73, 20.73, 20.73, 20.73, 19.34, 19.34, 19.34, 19.34, 19.34, 19.5, 19.5, 19.5, 19.5, 19.5, 19.84, 19.84, 19.84, 19.84, 19.84, 20.51, 20.51, 20.51, 20.51, 20.51, 20.57, 20.57, 20.57, 20.57, 20.57, 20.64, 20.64, 20.64, 20.64, 20.64, 20.52, 20.52, 20.52, 20.52, 20.52, 20.48, 20.48, 20.48, 20.48, 20.48, 20.38, 20.38, 20.38, 20.38, 20.38, 20.05, 20.05, 20.05, 20.05, 20.05, 19.39, 19.39, 19.39, 19.39, 19.39, 18.83, 18.83, 18.83, 18.83, 18.83, 18.74, 18.74, 18.74, 18.74, 18.74, 18.89, 18.89, 18.89, 18.89, 18.89, 19.04, 19.04, 19.04, 19.04, 19.04, 18.88, 18.88, 18.88, 18.88, 18.88, 18.79, 18.79, 18.79, 18.79, 18.79, 18.69, 18.69, 18.69, 18.69, 18.69, 18.48, 18.48, 18.48, 18.48, 18.48, 18.45, 18.45, 18.45, 18.45, 18.45, 18.48, 18.48, 18.48, 18.48, 18.48, 18.4, 18.4, 18.4, 18.4, 18.4, 18.48, 18.48, 18.48, 18.48, 18.48, 18.52, 18.52, 18.52, 18.52, 18.52, 18.47, 18.47, 18.47, 18.47, 18.47, 18.47, 18.47, 18.47, 18.47, 18.47, 18.51, 18.51, 18.51, 18.51, 18.51, 18.6, 18.6, 18.6, 18.6, 18.6, 18.66, 18.66, 18.66, 18.66, 18.66, 18.78, 18.78, 18.78, 18.78, 18.78, 18.86, 18.86, 18.86, 18.86, 18.86, 18.77, 18.77, 18.77, 18.77, 18.77, 18.79, 18.79, 18.79, 18.79, 18.79, 18.65, 18.65, 18.65, 18.65, 18.65, 18.59, 18.59, 18.59, 18.59, 18.59, 18.62, 18.62, 18.62, 18.62, 18.62, 18.67, 18.67, 18.67, 18.67, 18.67, 18.69, 18.69, 18.69, 18.69, 18.69, 18.69, 18.69, 18.69, 18.69, 18.69, 18.55, 18.55, 18.55, 18.55, 18.55, 18.55, 18.55, 18.55, 18.55, 18.55, 18.18, 18.18, 18.18, 18.18, 18.18, 18.15, 18.15, 18.15, 18.15, 18.15, 18.08, 18.08, 18.08, 18.08, 18.08, 17.76, 17.76, 17.76, 17.76, 17.76, 17.38, 17.38, 17.38, 17.38, 17.38, 17.39, 17.39, 17.39, 17.39, 17.39, 17.41, 17.41, 17.41, 17.41, 17.41, 17.48, 17.48, 17.48, 17.48, 17.48, 17.52, 17.52, 17.52, 17.52, 17.52, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.48, 17.48, 17.48, 17.48, 17.48, 17.46, 17.46, 17.46, 17.46, 17.46, 17.5, 17.5, 17.5, 17.5, 17.5, 17.55, 17.55, 17.55, 17.55, 17.55, 17.64]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 522 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1711566053 --> 1711566683
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.25, 0.25, 0.25, 0.25, 0.25, 0.18, 0.18, 0.18, 0.18, 0.18, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.23, 0.23, 0.23, 0.23, 0.23, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.22, 0.22, 0.22, 0.22, 0.22, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.27, 0.27, 0.27, 0.27, 0.27, 0.15, 0.15, 0.15, 0.15, 0.15, 0.28, 0.28, 0.28, 0.28, 0.28, 0.11, 0.11, 0.11, 0.11, 0.11, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.27, 0.27, 0.27, 0.27, 0.27, 0.3, 0.3, 0.3, 0.3, 0.3, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21, 0.21, 0.21, 0.21, 0.21, 0.09, 0.09, 0.09, 0.09, 0.09, 0.18, 0.18, 0.18, 0.18, 0.18, 0.32, 0.32, 0.32, 0.32, 0.32, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.07, 0.07, 0.07, 0.07, 0.07, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.23, 0.23, 0.23, 0.23, 0.23, 0.13, 0.13, 0.13, 0.13, 0.13, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.32, 0.32, 0.32, 0.32, 0.32, 0.48, 0.48, 0.48, 0.48, 0.48, 0.49, 0.49, 0.49, 0.49, 0.49, 0.46, 0.46, 0.46, 0.46, 0.46, 0.5, 0.5, 0.5, 0.5, 0.5, 0.54, 0.54, 0.54, 0.54, 0.54, 0.38, 0.38, 0.38, 0.38, 0.38, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.23, 0.23, 0.23, 0.23, 0.23, 0.24, 0.24, 0.24, 0.24, 0.24, 0.28, 0.28, 0.28, 0.28, 0.28, 0.18, 0.18, 0.18, 0.18, 0.18, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 522 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1711566053 --> 1711566683
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0]

phymbert · 2024-03-27T17:39:50Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3: 482 iterations 🚀

Concurrent users: 8

HTTP request : avg=9739.4ms p(90)=26438.48ms passes=482reqs fails=0reqs

Prompt processing (pp): avg=245.09tk/s p(90)=741.1tk/s total=193.57tk/s

Token generation (tg): avg=99.01tk/s p(90)=278.66tk/s total=129.61tk/s

Finish reason : stop=482reqs truncated=0

ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=hp/server/bench/workflow commit=1c1f8769947ef6e483809beec87b59051cf3e435

@ggerganov please review the added comment and I think we are good

phymbert · 2024-03-27T17:55:09Z

@ggerganov please review the added comment and I think we are good

It looks we have our baseline ^^but wondering why this VM is slower than before:

New VM: 482 iterations: server: continuous performance monitoring and PR comment #6283 (comment)
previous VM: 550 iterations: WIP server: bench: init phymbert/llama.cpp#1 (comment)

Is it the same CPU/RAM/T4 ?

ggerganov · 2024-03-27T18:16:41Z

Yes, it's Standard_NC4as_T4_v3 as before. Not sure why there is a difference

Btw, I'm not super confident about the PR comments - might get annoying to have those on every PR. For now lets put everything after the bullet points in <details> so they are more compact. And will see if we want to keep them depending on how useful/distracting they are

phymbert · 2024-03-27T18:56:16Z

Yes, it's Standard_NC4as_T4_v3 as before. Not sure why there is a difference

Btw, I'm not super confident about the PR comments - might get annoying to have those on every PR. For now lets put everything after the bullet points in <details> so they are more compact. And will see if we want to keep them depending on how useful/distracting they are

Done, note there is only one comment per PR

) * server: bench: init * server: bench: reduce list of GPU nodes * server: bench: fix graph, fix output artifact * ci: bench: add mermaid in case of image cannot be uploaded * ci: bench: more resilient, more metrics * ci: bench: trigger build * ci: bench: fix duration * ci: bench: fix typo * ci: bench: fix mermaid values, markdown generated * typo on the step name Co-authored-by: Xuan Son Nguyen <[email protected]> * ci: bench: trailing spaces * ci: bench: move images in a details section * ci: bench: reduce bullet point size --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

phymbert mentioned this pull request Mar 24, 2024

server: bench: continuous performance testing #6233

Closed

16 tasks

server: bench: init

4146960

phymbert force-pushed the hp/server/bench/workflow branch from 82c1e40 to 4146960 Compare March 25, 2024 20:12

server: bench: reduce list of GPU nodes

799317b

phymbert requested a review from ngxson March 25, 2024 20:17

phymbert marked this pull request as ready for review March 25, 2024 20:17

phymbert requested a review from ggerganov March 25, 2024 20:17

phymbert added performance Speed related topics server/webui need feedback Testing and feedback with results are needed labels Mar 25, 2024

phymbert changed the title ~~server: bench: init~~ server: continuous performance monitoring and PR comment Mar 25, 2024

phymbert added 2 commits March 25, 2024 21:44

server: bench: fix graph, fix output artifact

5c0b2a2

ci: bench: add mermaid in case of image cannot be uploaded

93434fd

mscheong01 reviewed Mar 26, 2024

View reviewed changes

phymbert added 5 commits March 26, 2024 08:09

ci: bench: more resilient, more metrics

5c2f8e6

ci: bench: trigger build

225f63b

ci: bench: fix duration

fb3b2f5

ci: bench: fix typo

bff4644

ci: bench: fix mermaid values, markdown generated

337c13b

ngxson approved these changes Mar 26, 2024

View reviewed changes

typo on the step name

1c1f876

Co-authored-by: Xuan Son Nguyen <[email protected]>

ngxson reviewed Mar 26, 2024

View reviewed changes

.github/workflows/bench.yml Show resolved Hide resolved

ci: bench: trailing spaces

30195d7

ci: bench: move images in a details section

fce86c3

ggerganov approved these changes Mar 27, 2024

View reviewed changes

ci: bench: reduce bullet point size

4a6bfa9

phymbert merged commit a016026 into master Mar 27, 2024
27 of 28 checks passed

phymbert deleted the hp/server/bench/workflow branch March 27, 2024 19:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: continuous performance monitoring and PR comment #6283

server: continuous performance monitoring and PR comment #6283

phymbert commented Mar 24, 2024 •

edited

Loading

mscheong01 Mar 26, 2024

phymbert Mar 27, 2024

mscheong01 Mar 26, 2024

phymbert Mar 26, 2024

mscheong01 Mar 26, 2024

phymbert Mar 26, 2024

ggerganov Mar 26, 2024

phymbert commented Mar 26, 2024

ggerganov commented Mar 26, 2024

phymbert commented Mar 26, 2024 •

edited

Loading

phymbert commented Mar 26, 2024

ggerganov commented Mar 26, 2024

phymbert commented Mar 26, 2024 •

edited

Loading

ngxson left a comment

ngxson Mar 26, 2024

phymbert Mar 26, 2024

ngxson Mar 26, 2024

phymbert commented Mar 26, 2024 •

edited

Loading

ngxson commented Mar 26, 2024

ggerganov commented Mar 26, 2024

phymbert commented Mar 26, 2024

ngxson commented Mar 26, 2024

phymbert commented Mar 26, 2024

ngxson commented Mar 26, 2024

github-actions bot commented Mar 27, 2024 •

edited

Loading

phymbert commented Mar 27, 2024

phymbert commented Mar 27, 2024

ggerganov commented Mar 27, 2024

phymbert commented Mar 27, 2024

server: continuous performance monitoring and PR comment #6283

server: continuous performance monitoring and PR comment #6283

Conversation

phymbert commented Mar 24, 2024 • edited Loading

Motivation

TODO:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phymbert commented Mar 26, 2024

ggerganov commented Mar 26, 2024

phymbert commented Mar 26, 2024 • edited Loading

phymbert commented Mar 26, 2024

ggerganov commented Mar 26, 2024

phymbert commented Mar 26, 2024 • edited Loading

ngxson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phymbert commented Mar 26, 2024 • edited Loading

ngxson commented Mar 26, 2024

ggerganov commented Mar 26, 2024

phymbert commented Mar 26, 2024

ngxson commented Mar 26, 2024

phymbert commented Mar 26, 2024

ngxson commented Mar 26, 2024

github-actions bot commented Mar 27, 2024 • edited Loading

phymbert commented Mar 27, 2024

phymbert commented Mar 27, 2024

ggerganov commented Mar 27, 2024

phymbert commented Mar 27, 2024

phymbert commented Mar 24, 2024 •

edited

Loading

phymbert commented Mar 26, 2024 •

edited

Loading

phymbert commented Mar 26, 2024 •

edited

Loading

phymbert commented Mar 26, 2024 •

edited

Loading

github-actions bot commented Mar 27, 2024 •

edited

Loading